3  Lesson 1 - Univariate Viz

4 Background

We’re starting our unit on data visualization or data viz, thus skipping some steps in the data science workflow. Mainly, it’s tough to understand how our data should be prepared before we have a sense of what we want to do with this data!

Source




Learning goals

Warm-up (together)

  • Convince ourselves about the importance of data viz.
  • Explore the “grammar of graphics”.

Exercises (in groups)

  • Explore how to create data viz in RStudio.
  • Understand the different basic univariate visualizations for categorical and quantitative variables.





Additional resources

Watch:

Read:





5 Warm-up

5.1 The importance of visualizations

EXAMPLE 1

The data below includes information on hiking trails in the 46 “high peaks” in the Adirondack mountains of northern New York state. This includes data on the hike’s highest elevation (feet), vertical ascent (feet), length (miles), time in hours that it takes to complete, and difficulty rating. Open this data in a viewer, through the Environment tab or by typing View(hikes) in the console.

# Import data
hikes <- read.csv("https://mac-stat.github.io/data/high_peaks.csv")

Tell me about the patterns and trends in hiking trail elevation. What about the the relationship between a hike’s elevation and the typical time it takes to summit / reach the top?

The higher elevation mountains are at the top of the chart and the lower elevation ones are at the bottom. Based on the table, I am not seeing any distinct correlations between a mountains elevation and the time it takes to summit.





EXAMPLE 2

What if this New York Times article tried telling this story without using data viz? What would that story be like?





Benefits of visualization

  • Understand what we’re working with:
    • scales & typical outcomes
    • outliers, i.e. unusual cases
    • patterns & relationships
  • Refine research questions & inform next steps of our analysis.
  • Communicate our findings and tell a story.





5.2 Components of data graphics

EXAMPLE 3

Data viz is the process of mapping data to different plot components. For example, in the NYT example above, the research team mapped data like the following (but with many more rows!) to the plot:

observation decade year date relative temp
1 2020-30 2023 1/23 1.2
2 1940-60 1945 3/45 -0.05

Write down step-by-step directions to use a data table like this one to create the temperature visualization. A computer is your audience. Thus be as precise as possible, but trust that the computer can find the exact numbers if you tell it where.





COMPONENTS OF GRAPHICS

In data viz, we essentially start with a blank canvas and then map data onto it. There are multiple possible mapping components. Some basics from Wickham (which goes into more depth):

  • a frame, or coordinate system
    The variables or features that define the axes and gridlines of the canvas.

  • a layer
    The geometric elements (e.g. lines, points) we add to the canvas to represent either the data points themselves or patterns among the data points. Each type of geometric element is a separate layer. These geometric elements are sometimes called “geoms” or “glyphs” (like heiroglyph!)

  • scales
    The aesthetics we might add to geometric elements (e.g. color, size, shape) to incorporate additional information about data scales or groups.

  • faceting
    The splitting up of the data into multiple subplots, or facets, to examine different groups within the data.

  • a theme
    Additional controls on the “finer points” of the plot aesthetics, (e.g. font type, background, color scheme).





EXAMPLE

In the NYT graphic, the data was mapped to the plot as follows:

  • frame: x-axis = date, y-axis = temp
  • layers: add one line per year, add dots for each month in 2023
  • scales: color each line by decade
  • faceting: none
  • a theme: NYT style





5.3 ggplot + R packages

We will use the powerful ggplot tools in RStudio to build (most of) our viz. The gg here is short for the “grammar of graphics”. These tools are developed in a way that:

  • recognizes that code is communication (it has a grammar!)
  • connects code to the components / philosophy of data viz



EXAMPLE: ggplot in the news



To use these tools, we must first get them into R/RStudio! Recall that R is open source. Anybody can build R tools and share them through special R packages. The tidyverse package compiles a set of individual packages, including ggplot2, that share a common grammar and structure. Though the learning curve can be steep, this grammar is intuitive and generalizable once mastered. Image source: Posit BBC on X

Follow the directions below to install this package, the directions depending upon whether or not you’re working on Mac’s server. Unless the authors of a package add updates, you only need to do this once all semester. To install:

  • If you’re working on Mac’s RStudio server
    tidyverse is already installed on the server! Check this 2 ways.
    • Type library(tidyverse) in your console. If you don’t get an error, it’s installed!
    • Check that it appears in the list under the “Packages” tab (bottom right pane).
  • If you’re working with a desktop version of R/RStudio
    In the “Packages” tab (bottom right pane), click “Install”. From there type the name of the package (tidyverse), make sure the “Install dependencies” box is checked, and click “Install”.





6 Exercises

Goals

We’ll talk more later about “good” data viz, and steps to think about when building viz. In these exercises, the goal is to:

  • Familiarize yourself with the ggplot() structure and grammar.
  • Build univariate viz, i.e. viz for 1 variable at a time.
  • Start recognizing the different approaches for visualizing categorical vs quantitative variables.



Directions

  • General
    • Be kind to yourself.
    • Collaborate with and be kind to others. You are expected to work together as a group.
    • Ask questions. Remember that we won’t discuss these exercises as a class.
  • Activity specific
    • The best way to learn ggplot is to just play around. Focus on the patterns and potential of the code. Don’t worry about memorizing anything! You will naturally start to remember the most important / common code the more and more you use it.



Exercise 1: Research questions

Let’s dig into the hikes data, starting with the elevation and difficulty ratings of the hikes:

head(hikes)
             peak elevation difficulty ascent length time    rating
1     Mt. Marcy        5344          5   3166   14.8 10.0  moderate
2 Algonquin Peak       5114          5   2936    9.6  9.0  moderate
3   Mt. Haystack       4960          7   3570   17.8 12.0 difficult
4   Mt. Skylight       4926          7   4265   17.9 15.0 difficult
5 Whiteface Mtn.       4867          4   2535   10.4  8.5      easy
6       Dix Mtn.       4857          5   2800   13.2 10.0  moderate
  1. What features would we like a visualization of the categorical difficulty rating variable to capture?

A bar graph would best visualize the rating of the categorical variable difficulties.

  1. What about a visualization of the quantitative elevation variable?

A histogram would best visualize the quantitative variable elevations.



Exercise 2: Load tidyverse

We’ll address the above questions using ggplot tools. Try running the following chunk and simply take note of the error message – this is one you’ll get a lot!

# Use the ggplot function
ggplot(hikes, aes(x = rating))

In order to use ggplot tools, we have to first load the tidyverse package in which they live. Mainly, we’ve installed the package but need to tell R when we want to use it. Run the chunk below to load the library. You’ll need to do this within any .qmd file that uses ggplot().

# Load the package
library(tidyverse)





Exercise 3: Bar chart of ratings (part 1)

Consider some specific research questions about the difficulty rating of the hikes:

How many hikes fall into each category? Are the hikes evenly distributed among these categories, or are some more common than others?

All of these questions can be answered with: (1) a bar chart; of (2) the categorical data recorded in the rating column. First, set up the plotting frame:

ggplot(hikes, aes(x = rating))

Think about:

  • What did this do? What do you observe?
  • What, in general, is the first argument of the ggplot() function?
  • What is the purpose of writing x = rating?
  • What do you think aes stands for?!?





Exercise 4: Bar chart of ratings (part 2)

Now let’s add a geometric layer to the frame / canvas, and start customizing the plot’s theme. To this end, try each chunk below, one by one. In each chunk, make a comment about how both the code and the corresponding plot both changed.

NOTE:

  • Pay attention to the general code properties and structure, not memorization.
  • Not all of these are “good” plots. We’re just exploring ggplot.
# The data was added within the frame for difficulty ratings of each hike. 
ggplot(hikes, aes(x = rating)) +
  geom_bar()

# This data is comparing the number of hikes and the difficulty rating. 
ggplot(hikes, aes(x = rating)) +
  geom_bar() +
  labs(x = "Rating", y = "Number of hikes")

# This histogram is comparing the number of hikes and difficulty rating and simultaneously filling the bars with blue. 
ggplot(hikes, aes(x = rating)) +
  geom_bar(fill = "blue") +
  labs(x = "Rating", y = "Number of hikes")

# This histogram does the same as the last chart but outlines the bars in orange. 
ggplot(hikes, aes(x = rating)) +
  geom_bar(color = "orange", fill = "blue") +
  labs(x = "Rating", y = "Number of hikes")

# This turned the background white. 
ggplot(hikes, aes(x = rating)) +
  geom_bar(color = "orange", fill = "blue")  +
  labs(x = "Rating", y = "Number of hikes") +
  theme_minimal()





Exercise 5: Bar chart follow-up

Part a

Reflect on the ggplot() code.

  • What’s the purpose of the +? When do we use it?

    The plus is to add layers to your ggplot.

  • We added the bars using geom_bar()? Why “geom”?

    Geom refers to the geometric object we use to visualize the data.

  • What does labs() stand for?

    Labels, you can modify ggplot labels with this function.

  • What’s the difference between color and fill?

    Color outlines while fill fills in the entire shape.

Part b

In general, bar charts allow us to examine the following properties of a categorical variable:

  • observed categories: What categories did we observe?

    We observed rating.

  • variability between categories: Are observations evenly spread out among the categories, or are some categories more common than others?

    The moderate category is more common, so they are not spread evenly.

We must then translate this information into the context of our analysis, here hikes in the Adirondacks. Summarize here what you learned from the bar chart, in context.

Most hikes in the Adirondacks are moderate, second most are easy, and the least amount of hikes are difficult.

Part c

Is there anything you don’t like about this barplot? For example: check out the x-axis again.





Exercise 6: Sad bar chart

Let’s now consider some research questions related to the quantitative elevation variable:

Among the hikes, what’s the range of elevation and how are the hikes distributed within this range (e.g. evenly, in clumps, “normally”)? What’s a typical elevation? Are there any outliers, i.e. hikes that have unusually high or low elevations?

Here:

  • Construct a bar chart of the quantitative elevation variable.
  • Explain why this might not be an effective visualization for this and other quantitative variables. (What questions does / doesn’t it help answer?)

This is not helpful for mapping quantitative variables because it can only map a singular variable.

ggplot(hikes, aes(x = elevation)) +
  geom_bar()





Exercise 7: A histogram of elevation

Quantitative variables require different viz than categorical variables. Especially when there are many possible outcomes of the quantitative variable, it’s typically insufficient to simply count up the number of times we’ve observed a particular outcome (as the bar graph did above). It gives us a sense of ranges and typical outcomes, but not a good sense of how the observations are distributed across this range. We’ll explore two methods for graphing quantitative variables: histograms and density plots.

Histograms are constructed by (1) dividing up the observed range of the variable into ‘bins’ of equal width; and (2) counting up the number of cases that fall into each bin. Check out the example below:

Part a

Let’s dig into some details.

  • How many hikes have an elevation between 4500 and 4700 feet?

About 6.

  • How many total hikes have an elevation of at least 5100 feet?

3 or 4.

Part b

Now the bigger picture. In general, histograms allow us to examine the following properties of a quantitative variable:

  • typical outcome: Where’s the center of the data points? What’s typical?
  • variability & range: How spread out are the outcomes? What are the max and min outcomes?
  • shape: How are values distributed along the observed range? Is the distribution symmetric, right-skewed, left-skewed, bi-modal, or uniform (flat)?
  • outliers: Are there any outliers, i.e. outcomes that are unusually large/small?

We must then translate this information into the context of our analysis, here hikes in the Adirondacks. Addressing each of the features in the above list, summarize here what you learned from the histogram, in context.





Exercise 8: Building histograms (part 1)

2-MINUTE CHALLENGE: Thinking of the bar chart code, try to intuit what line you can tack on to the below frame of elevation to add a histogram layer. Don’t forget a +. If it doesn’t come to you within 2 minutes, no problem – all will be revealed in the next exercise.

ggplot(hikes, aes(x = elevation)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.





Exercise 9: Building histograms (part 2)

Let’s build some histograms. Try each chunk below, one by one. In each chunk, make a comment about how both the code and the corresponding plot both changed.

# Count is plotted on the y-axis and elevation is plotted on the x-axis.
ggplot(hikes, aes(x = elevation)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# The individual bar shapes were outlined in white.
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white") 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# The individual bar shapes were filled in with blue.
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white", fill = "blue") 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# The axis labels were re-named to Number of Hikes and Elevation (feet).
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white") +
  labs(x = "Elevation (feet)", y = "Number of hikes")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# The individual bar shapes were merged. 
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white", binwidth = 1000) +
  labs(x = "Elevation (feet)", y = "Number of hikes")

# The individual bars were broken up into more. 
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white", binwidth = 5) +
  labs(x = "Elevation (feet)", y = "Number of hikes")

# The bar shapes width were changed to 200 which created a more readable plot. 
ggplot(hikes, aes(x = elevation)) +
  geom_histogram(color = "white", binwidth = 200) +
  labs(x = "Elevation (feet)", y = "Number of hikes")





Exercise 10: Histogram follow-up

  • What function added the histogram layer / geometry?
geom_histogram
function (mapping = NULL, data = NULL, stat = "bin", position = "stack", 
    ..., binwidth = NULL, bins = NULL, na.rm = FALSE, orientation = NA, 
    show.legend = NA, inherit.aes = TRUE) 
{
    layer(data = data, mapping = mapping, stat = stat, geom = GeomBar, 
        position = position, show.legend = show.legend, inherit.aes = inherit.aes, 
        params = list2(binwidth = binwidth, bins = bins, na.rm = na.rm, 
            orientation = orientation, pad = FALSE, ...))
}
<bytecode: 0x11cc55438>
<environment: namespace:ggplot2>
  • What’s the difference between color and fill?

Color outlines whereas fill fills in the shapes.

  • Why does adding color = "white" improve the visualization?

It distinguishes the individual bins more.

  • What did binwidth do?

It changed the width of the bins.

  • Why does the histogram become ineffective if the binwidth is too big (e.g. 1000 feet)?

It swallows too much data and over generalizes the data points.

  • Why does the histogram become ineffective if the binwidth is too small (e.g. 5 feet)?

There are too many data points now and they are hard to see.





Exercise 11: Density plots

Density plots are essentially smooth versions of the histogram. Instead of sorting observations into discrete bins, the “density” of observations is calculated across the entire range of outcomes. The greater the number of observations, the greater the density! The density is then scaled so that the area under the density curve always equals 1 and the area under any fraction of the curve represents the fraction of cases that lie in that range.

Check out a density plot of elevation. Notice that the y-axis (density) has no contextual interpretation – it’s a relative measure. The higher the density, the more common are elevations in that range.

ggplot(hikes, aes(x = elevation)) +
  geom_density()

Questions

  • INTUITION CHECK: Before tweaking the code and thinking back to geom_bar() and geom_histogram(), how do you anticipate the following code will change the plot?

    • geom_density(color = "blue")

    This will outline the line in blue.

    • geom_density(fill = "orange")

    This will fill everything under the line with orange.

  • TRY IT! Test out those lines in the chunk below. Was your intuition correct?

ggplot(hikes, aes(x = elevation)) +
  geom_density(color = "blue", fill = "orange")

  • Examine the density plot. How does it compare to the histogram? What does it tell you about the typical elevation, variability / range in elevations, and shape of the distribution of elevations within this range?





Exercise 12: Density plots vs histograms

The histogram and density plot both allow us to visualize the behavior of a quantitative variable: typical outcome, variability / range, shape, and outliers. What are the pros/cons of each? What do you like/not like about each?





Exercise 13: Code = communication

We obviously won’t be done until we talk about communication. All code above has a similar general structure (where the details can change):

ggplot(___, aes(x = ___)) + 
  geom___(color = "___", fill = "___") + 
  labs(x = "___", y = "___")
  • Though not necessary to the code working, it’s common, good practice to indent or tab the lines of code after the first line (counterexample below). Why?
# YUCK
ggplot(hikes, aes(x = elevation)) +
geom_histogram(color = "white", binwidth = 200) +
labs(x = "Elevation (feet)", y = "Number of hikes")

  • Though not necessary to the code working, it’s common, good practice to put a line break after each + (counterexample below). Why?
# YUCK 
ggplot(hikes, aes(x = elevation)) + geom_histogram(color = "white", binwidth = 200) + labs(x = "Elevation (feet)", y = "Number of hikes")





Exercise 14: Practice

Part a

Practice your viz skills to learn about some of the variables in one of the following datasets from the previous class:

# Data on students in this class
survey <- read.csv("https://mac-stat.github.io/data/high_peaks.csv")

# World Cup data
world_cup <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-11-29/worldcups.csv")
ggplot(world_cup, aes(x = host)) +
  geom_bar(color = "white") +
  labs(x = "World Cup Host", y = "Number of Times Hosted")

Part b

Check out the RStudio Data Visualization cheat sheet to learn more features of ggplot.